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Abstract 

This work introduces tire IB-score, a family of independence-based score functions for ro- 
bust learning of Markov networks independence structures. Markov networks are a widely 
used graphical representation of probability distributions, with many applications in sev- 
eral fields of science. The main advantage of the IB-score is the possibility of computing it 



< 

t/3 , without the need of estimation of the numerical parameters, an NP-hard problem, usually 

I solved through an approximate, data-intensive, iterative optimization. We derive a formal 

expression for the IB-score from first principles, mainly maximum a posteriori and condi- 
tional independence properties, and exemplify several instantiations of it, resulting in two 
novel algorithms for structure learning: IBMAP-HC and IBMAP-TS. Experimental re- 
sults over both artificial and real world data show these algorithms achieve important error 
, reductions in the learnt structures when compared with the state-of-the-art independence- 

ff^ ' based structure learning algorithm GSMN, achieving increments of more than 50% in the 

amount of independencies they encode correctly, and in some cases, learning correctly over 
90% of the edges that GSMN learnt incorrectly. Theoretical analysis shows IBMAP-HC 
proceeds efficiently, achieving these improvements in a time polynomial to the number of 
random variables in the domain. 

Keywords: Graphical Models, Structure Discovery, Independence-based , Score-based, 
Markov networks. Reliability. 



1. Introduction 

The present work presents a novel approach for the problem of learning the indepen- 
dence structure of a Markov network (MN) from data by taking a maximum-a-posterior i 
(MAP) Bayesian perspective of the independence-based approach ( Spirtes et al. . 200d ) . 



Independence-based algorithms learn the structure by performing a succession of statistical 
tests of independence among different groups of random variables. They are appealing for 
two important reasons: they are amenable to proof of correctness (under assumptions, see 
more details on these and other Markov network concepts in the next section below); and 
they are efficient, reaching time complexities of O(n^) in the worst case, with n the number 
of random variables in the domain. This runtime is an overwhelming improvement over 
score-based approaches such as maximum-likelihood, that has super-exponential runtime 
(more later). 



1 



Unfortunately, the above holds only in theory, as the correctness of the algorithms, i.e., 
the guarantee they produce the correct structure, is compromised when the independence 
decision of the statistical tests is unreliable, an almost certainty when these tests are per- 
formed on data. Moreover, as we discuss in more detail in the following sections, for tests to 
be reliable they require data with size exponential in the number of variables n.To address 
these important concerns, the present work focuses on algorithms for improving the quality 
of independence-based approaches, while maintaining a manegeable runtime. 

Our main contribution is the IB-score, the result of a hybrid approach between indepen- 
dence and score based a pproaches, with improvements oy er both. On one hand, generalizing 
on previous works, e.g. ( Margaritis and Brombere . 20091 ). it improves the quality of existing 
independence-based algorithms by taking a Bayesian, MAP approach, modelling explicitly 
the posterior distribution of structures given the data. On the other, it can be computed 
efficiently, contrary to existing scores, as it does not require estimating the model param- 
eters, an NP-hard computation usually approximated through a data-intensive iterative 
algorithm. This is achieved by first modelling the uncertainty in the independencies as 
random variables, and then expanding the posterior over these independencies, an opera- 
tion that results in an expression for the posterior of the structures dependent only on the 
posterior of independence assertions, that can be computed efficiently (c.f. 

We now proceed in the next Section [2] to discuss in more detail many of these concepts 
and algorithms, including a thorough literature review. Following, Section [3] formalizes and 
derives the IB-score from first-principles. Section |4] introduces two algorithms IBMAP- 
HC and IBMAP-TS, each exemplifying two different possible optimization searches for 
the MAP, and two different possible instantiations of the IB-score. Then, in Section [5] 
we present experimental results that confirm the robustness of IBMAP-HC and IBMAP- 
TS, and the polynomial runtime of IBMAP-HC. To conclude, we present a summary and 
possible directions of future works in Section [D 



2. Markov networks 

In this section we describe and motivate MNs in more detail, comparing various existing 
approaches for learning an independence structure from data (Section 12. ip . expanding on 
the deficiencies of the approaches described above, and how our contribution addresses these 
deficiencies successfully. 

The innovation rate of computing and digital storage capacity is increasing rapidly as 
time passes, producing an important increase in the data available in digital format. As 
a response, the community of data mining and machine learning is constantly producing 
novel and improved algorithms for extraction of the knowledge and information implicit in 
this data. Without a doubt, the well-established probabilistic theory is a powerful mod- 
eling tool contributing to most of these algorithms, speciall y when the data is uncertain 
i.e., measurements a r e nois y). Probabilistic graphical models ( Pear j 19881 : Lauritzen . Il996l : 



Roller and Friedmanl . [2OO9I ) are a family of multi-variate, efficient probability distributions 



that, by codifying implicitly the conditional independences among the random variables 
in a domain, produce sometimes exponential reductions in storage requirements and time- 
complexity of statistical inference (more details below). These efficiencies are the reason 
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that probabilistic graphical models (or simply graphical models) has had an increasingly 
important role in the analysis and design of machine learning algorithms in recent years. 



199ll : lAnguelov et al 



Examples of successful applications include: computer vision (jBesag et alj 

2OO5I ) ■ restoration of noisy im ages, texture classifica tion and image segmentation; genetic re- 
search and disease diagnosis dFriedman et al.l.l2000ll: spatia l data mining such as geography, 



transport, ecology, and many others ([Shekhar et al.1 . 12004 ) . 



A graphical model is a probabilistic model over the joint set V of random variables. It 
consists in a set of numerical parameters 0, and a graph G which encodes — compactly — 
independences among all the random variables of the domain. The edges of G represent 
explicitly all the possible probabilistic conditional independences among the variables, thus 
called sometimes the independence structure or simply structure of the domain. We are 
interested in the problem of learning a graphical model from data, that consists in learning 
both the independence structure G, and the numerical parameters 0, as illustrated in Fig.[TJ 
As in any statistical modeling process, we assume the data available (Fig. [H center) is a 
sampling of some unknown underlying probability distribution (Fig. [H left), and learning 
consists in producing a structure G and parameters (Fig. [H right) that best fit the data, 
with the hope that this model matches the underlying distribution. 



The underlying model 
(unknown in real world) 



A data set sampled 
from ( G , e ) (known) 
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Figure 1: Outline of the problem: we have at our disposal a dataset (center), assumed to be 
a sampling of some unknown underlying probability distribution (left). Learning consists 
in analyzing the input dataset to produce a model distribution (left). Learning consists in 
analyzing the input dataset to produce a model that best fits it (right), with the hope that 
it matches the underlying model. 

There are two predominant types of graphical models: Bayesian networks (BN) and 
Markov networks (MN). For BNs, G is a directed and acyclic graph, and for MNs, G 
is an undirected graph. These two families differ in the independencies they can encode. 
Not all sets of independences — and therefore not all the probability distributions — are 
representable by graphs, and moreover, some sets of independences — and therefore some 
distributions — are only representable by a BN, and others only by a MN (and some exists 
that cannot be represented by neither a BN nor MN). For the case of MNs (our focus in 
this work), distributi ons that can be faithfully representable by undirected graphs are called 
graph-is omorph (see ( Pearl . 19881 ) for more details). A common practice, and one we follow 



in this work, is to assume that the underlying distribution (of which the input data is a 
sample) is graph-isomorph. 



3 



In what follows, we use capitals X , Y , Z , . . . to denote random variables, and bold 
capitals X, Y, Z, . . . to denote sets of random variables. We denote by V the set of all the 
random variables in the problem domain, and its cardinality by n, i.e., | V |= n. In this 
work, we assume that all variables in V are discrete. We denote the input dataset by T>, 
and its cardinality (i.e., the number of data points) by N. As we said before, a MN consists 
in a set of numerical parameters 0, and a graph G that encodes conditional independences 
among the random variables contained in the set V. We use the notation I{X, y | Z) to 
denote the conditional independence predicate I over the triplet {X, Y \ Z) which is true 
(false) whenever X is independent (dependent) of Y, given the set of variables Z. The 
graph G consists of n nodes, one for each X € V, and a set of edges E{G) between this 
nodes that encode the set of all probabilistic conditional independences among the random 
variables in V. Fig. [2] shows an example of a MN. The encoding of independences in a MN 
is as follows: 



{x,Y)&E{G) ^ vzc {v-{x,y}},^/(x,y I z) 



(1) 



In other words, an edge between two variables X and Y encodes the fact that X is 
dependent of Y , conditioned on any set Z C {V — {X, Y}}, and the lack of an edge between 
variables X an d Y m e ans th at 3Z C {V - {X,Y}} that satisfies I{X,Y \ Z). 

As shown in Pearl (| 19881 ). when the underlying model is graph-isomorph, the encoding is 
equivalent to reading independences through vertex- separation, i.e., two variables X and Y 
are independent (dependent) according to the graph given the set Z, if and only if variables 
X and Y are disconnected (connected) in the sub-graph of G resulting from removing all 
edges incident in a variable in Z. In other words, the set Z intercepts (does not intercepts) 
every path from X to Y . To illustrate, consider variables and 5. These variables are 
dependent given {2} because there is still a path through variable 3, and thus independent 
given {2, 3}. 




Figure 2: An example Markov network with 7 nodes. 



As mentioned, the strength of graphical models, both MNs and BNs, lies in the some- 
times exponential reduction in storage capacity and time complexity of statistical infer- 
ence. Storage and statistical inference (mostly marginalization) is exponential because the 
representation of a multi-variate joint probability distribution over V consists in a multi- 
dimensional table with n + 1 columns, and exponentially many tuples, each consisting in 
a complete assignments of the n variables, and one column to store the probability of that 
assignment. Fig. [3] (left) illustrates the table for the joint probability of a system of n binary 
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variables (i.e., that can take any of two values), consisting on 2" tuples, one per configura- 
tion. The exponential reduction in storage and inference occurs because certain conditional 
independences allow a decomposition of the joint probability into a product of factor or 
potential functions, each quantified by the parameters 0, and dependent on only a subset 
of V. For instance, it is a well-known fact that when all n variables in V = {Vi,..., V^} 

n 

are mutually independent, Fi{Vi, ...,Vn) = H Pr(Vi). Since the tables associated with a 

i=l 

function over some set of variables is exponential in the number of variables in that set (as 
exemplified earlier in this paragraph for the case of the joint), the decomposition requires 
then a polynomial number of (exponentially) smaller tables. Moreover, it can be shown 
that in many cases the number of variables in each factors is bounded by a small number, 
resulting in the exponential reduction in both storage and marginalization. To continue 
with our example, the decomposition of the n binary variables results in n tables with only 
2 tuples each, as illustrated in Fig. [3] (right). An important difference between MNs and 
BNs lies in the properties of their factor functions. The factor functions of MNs are not 
normalized, thus, in order to obtain a fully quantified probabilistic model, an exponential 
computation of a normalization constant (known as the partition function) over all possible 
assignments of the n variables is required. BNs, instead, allow a factorization of the joint 
distribution into conditional probability distributions, i.e., normalized factors, that can thus 
be learned efficiently from data (does not require normalization). 
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Figure 3: Storage reduction over n binary variables from 2^ to 2n. 



2.1 Structure learning algorithms: background and related work 



In this section we discuss the problem, and motivate our approach for solving it, in the 
context of existing algorithms for learning structure from data. 

Historically, the literature has presented two broad approaches for learning MNs: score- 
based algorithms, and independe nce-based (also called con s trained-based) alg o rithms. Score- 

based algorithms, exemplified by Lam and Bacchus ( 1994 ): McCallum ( 20031 ): Acid and de Campos 
(2003 ). perform a search on the space of all possible graphs to find the structure with maxi- 
mum score. These algorithms differ in the approach taken to explore the space of all graphs 
(super-exponential in size, e.g., there are 2^" choose 2) undirected graphs with n nodes), and 
in the functional form, and theoretical justification, for the score. Example s of scores ar e 
maximum likelihood Pr(T> \ G, 0) of the data "D given the model (0,G) (jBesad . \mi ). 
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minimum description length of the model (&.G) (ILam and Bacchua. ligO^ ). or pseudo- 
likelihood ( Besaj . 1975 : Yu and Cheng , 2003 : Koller and Friedman . 2009 )). again of the V 
given the model (0,G). The last approach, pseudo-likelihood, introduces an approximate 
expression for computing the likelihood that does not require the normalization of the model 
(i.e., the computation of its normalization constant). To compute the score of each struc- 
ture G "visited" during the search, it is nec essary to estimate the parameters 0. For BNs 



3_pa 

this operation can be performed efficiently (i Heckerman et al.l. Il995ll. but for MNs this is a 



costly operation that consis ts in an optimization in itself (|Besad . ll974l : IYu and Che^ . l2003l : 
Koller and Friedman . 20091 ) . For instance, in the case of the likelihood, the optimization 
would be over the space of parameters, for a fixed structure G. We will later show that our 
approach, although it requires a search over the space of structures, can compute efficiently 
the (independence-based) score at each step as it does not require the estimation of the 
parameters. 



In dependence-based algo r ithms (ISpirtes et al.l.l2000l: lAliferis et al.l . l2003l : iTsamardinos et al.l . 



20031 : iBromberg eHU . l2009l : iMargaritis and Brombergj . l2009l ) have the ability to learn the 
independence structure efficiently, as they do not require neither a search over structures, 
nor the estimation of parameters at each step of their execution (of course, in order to ob- 
tain a complete model {G, &) they do need to estimate the parameters once the structure 
has been learned). Instead, these algorithms take a rather direct approach for construct- 
ing the independence structure by inquiring about the conditional independences that hold 
in the input data. This inquiry is performed in practice through sta tistical test s such 
as Pearson's test ( Agresti . |20o3 ). more recently the Bayesian test ( Margaritii . 20051 : 
Marg:aritis and BrombereT 20091 ). and for continuous Gaussian data the partial correlation 



test (ISpirtes et al.l . 12000), among others. At each step, these algorithms propose a triplet 



{X, Y \7a) and inquire the data whether independence or dependence holds for that triplet. 
Which triplet is proposed at some step depends on the independence information the al- 
gorithm has up to that point, that is, which triplets has been proposed so far, and their 
corresponding independence value. To find the structure consistent with this independence 
information, these algorithms proceed by discarting, at each step, all those structures that 
are inconsistent (i.e., do not encode) the independence just learned, until all but one struc- 
ture has been discarded. Algorithms exist that require a number of tests that is polynomial 
in the number n of variables in the domain, which, together with the fact that statisti- 
cal tests can be executed in a running time proportional to the number of data points N 
in the dataset, result in a total running time that is polynomial in both n and N . On 
top of their efficiency, another important advantage of these algorithms, is that, under as- 
sumptions (statistical tests are correct, and the underlying model is graph- isomorph) , it is 
possible to prove the correctness of these algorithms, i.e., that they return the structure 
of the underlying mod el. For a thorough example of such proof we refer the reader to 



Bromberg et al 



(|2009l ). Unfortunately, statistical independence tests are not always reli- 
able. Most independence-based algorithms are oblivious to this fact, evident from their 
design, which discards structures based on a single test. If the test is wrong (i.e., it asserts 
independence when in fact the variables in the triplet are dependent, or vice versa), the 
underlying (true) structure may be discarded. This problem is not to be underestimated be- 
cause the quality of statistical independence te sts degra,des ex ponentially with the number 
of variables involved in the test. For example, (jCochranl . Il954l l recommends that Pearson's 
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independence test be deemed unreliable if more than 20% of the cells of the tests con- 
tingency table have an expected count of less than 5 data points. Since the contingency 
table of a cond itional independence test over {X, y | Z) is d-dimensional, with d = 2+ | Z | 
(jAgrestil . |2002| ). the number of cells, and thus the number of data points required, grows 
exponentially with the size of the test. In other words, for a fixed size dataset, the quality 
of the test degrades exponentially with the size of its conditioning set. 

This problem has been addressed by Bromberg and Margaritis ( 20091 ) using Argumen- 
tation. Their approach modeled the problem as an inconsistent knowledge base consisting 
on Pearl's axioms of conditional independence as rules, and the triplets and their corre- 
sponding assignments of independence values as fact predicates. If the und erlyin g mode l 
is graph-isomorph, it s set of independencies must satisfy the axioms (see Pearl ( 19881 ): 
Koller and Friedman ( 20091 ) for details). In practice, however, independencies are not 



queried directly to the underlying model, but to its sampled dataset. These independencies 
may be unreliable, and thus may violate the rules. A rgumentation is an infereri c e proc edure 
designed to work in inconsistent knowledge bases. Bromberg and Margaritis (j2009l ) used 
this framework to infer independencies, sometimes resulting in inferred values different than 
measured values. Taking the inferred value as the correct one, they obtained important im- 
provements in the quality of the reliability of statistical independence tests, and thus of the 
independence structures discovered. 

We present here an alternative approach to th e unre liability problem. Our approach is 
inspired by the work of Margaritis and Bromberg ( 20091 ). that designed an algorithm that 
instead of discarding structures based on outcomes of statistical tests, maintains a distribu- 
tion over structures conditioned on the data, i.e., the posterior distribution of the structures. 
To learn the structure, the algorithm takes the maximum-a-posteriori approach. In the next 
section we present the IB-score (independence-based score) , an expression for computing ef- 
ficiently the posterior of some structure based on outcomes of statistical independence tests, 
and in the following sections we use this score in two different algorithms that conduct the 
search for its maximum. The resulting algorithms are hybrids of score and independence 
based algorithms. Score-based because they proceed by maximizing a score, i.e., the poste- 
rior of the structures given the data or as we call it, the IB-score; and independence-based, 
because the IB-score is computed through statistical tests of independence (thus its name). 
However, contrary to previous score-based approaches, the computation of the score (the 
IB-score) does not require the estimation of the numerical parameters and is therefore 
efficient. 



3. Our approach: independence-based MAP structure learning 

As discussed in the previous section, independence-based algorithms has the advantage of 
not requiring neither a search over structures, nor an interleaved e stimation of the param- 
eters estimation at each iteration of the search (a search in itself, (|Yu and Chengl . |2003| )). 
Unfortunately, these advantages are compensated by a major robustness problem as the 
vast majority of the independence-based algorithms rely blindly on the correctness of the 
tests they perform, with the risk of discarting the true, underlying structure when the test 
is incorrect. This problem is exacerbated by the fact that the reliability of statistical in- 
dependence tests degrades exponentially with the size of the conditioning set of the test 
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(for some fixed size dataset). To over come this r obustness problem we take a Bayesia n 
approach, inspired and extended from ( Margaritid . 2005 : Margaritis and Brombere . 20091 ) . 



This approach models the problem of MN structure learning as a distribution over structure 
given the data, i.e., the posterior distribution of structures. Under this model, structures 
that are inconsistent with the outcome of a test are not discarded, but their probability is 
reduced. The approach taken is a combination of score and independence based algorithms 
for structure learning. Score-based because we search for the structure G* whose poste- 
rior probability Pr(G | T>) is maximal, that is, we take the maximum-a-posteriori (MAP) 
approach: 

= argmaxPr(G I P), (2) 

G 

and independence-based because the expression obtained for computing the posterior 
probability is based on the outcome of statistical independence tests, as explained in detail in 
the following section. Later, in section[4]we present two practical algorithms for conducting 
the maximization: IBMAP-HC (Independence-based maximum a posteriori hill climbing) 
and IBMAP-TS (Independence-based maximum a posteriori tree search). 

3.1 Independence-based posterior for MN structures: the IB-score 

In this section we present the IB-score, a computationally feasible expression for the pos- 
terior Pr(G I T>). We proceed by re-expressing the posterior of structure G in terms of 
the posterior of a particular set of independence assertions C{G) we call the closure, and 
then, through approximating assumptions, obtain an efficiently computable expression for 
the posterior we call the IB-score. 

Let us first define formally the closure C{G) of a structure G: 

Definition 1 (Closure). A closure of an undirected independence structure G is a set C{G) 
of conditional independence assertions that are sufficient for determining G. 

We can illustrate this definition by a couple of examples: 

Example 1. According to Eq. we can construct a closure for G by adding the assertion 
I{X, Y I V— {X, Y}) for every pair of variables X,Y & \' , X ^ Y , for which there is no edge 
inG, i.e., {X,Y) ^ E{G). Otherwise, if there is an edge between them, i.e., {X,Y) E E{G), 
add the assertions -^I{X,Y \ Z) for every Z C V — {X, y}. 

Example 2. Our second example considers any independence-based algorithm for which 
exists a proof of correctness; under the usual assumptions of faithfulness and correctness of 
the independence assertions. That is, proof that when provided with correct independence 
values for each independence inquiry, the structure G* output by the algorithm is the only 
one consistent with those values. Being G* the only possible structure for those indepen- 
dencies, the independencies determine G* , and they conform a closure. Examples o f correct 
algorithms for MNs are the GSMN and GSIMN algorithms i Brombera et al. . 200^ ). 



We now prove an important Lemma that asserts that the posterior Pr(G | P) of a 
structure exactly matches the posterior Pr(C(G) | V), for any closure of G C{G). 
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Lemma 2 (Probabilistic equivalence of structures and closures) . For any undirected inde- 
pendence structure G and any closure C{G) of G, it holds that 

Pr(G I V) = Pr(C(G) | V). 

Proof The proof proceeds by incorporating, in two steps, the information of independence 
assertions contained in the closure C{G) of G into the posterior Pr(G | V). First, we 
model as random variables the uncertainty of independence assertions obtained through 
unreliable statistical independence tests. Formally, the uncertainty in the independence 
assertion I{T) = t, denoting that the triplet T is independent (dependent) when t = true 
{t = false), is formalized by a random variable T taking the values t € {true, false} 
(note the notation overload). Second, for each independence assertion (/(Tj) = ij) G C(G), 
i = 1, . . . ,c, and c = |C(G)|, we incorporate Tj (its corresponding random variable) into the 
posterior using the law of total probability, namely: 

Pr{G\V)= J2 E ■■■ E P<G,Ti = ti,T2=t2,...,T, = tc\V) 
ti6{t,f}t2e{t,f} tce{t,f} 

where t and f are abbreviations of true and false, respectively. We simplify this expres- 
sion by first collapsing the c uni-dimensional summations over {t, /} into a single, multi- 
dimensional summation over {t, fY, then abbreviating Tj = ti by ti, and then {ti,t2, . . . ,tc} 
by ti:c, obtaining 

Fr{G\V)= J2 Pr(G,ii:e|P). 
ti:ce{t,f}= 

Applying the chain rule to the terms in the summation we obtain 

Pr(G \V)= Yl I ^) I 25) 

ti:ce{t,f}= 

Now, by definition of a closure, a structure G is determined by its independence asser- 
tions, i.e., the probability of G given these assertions (and any other variable such as V), 
must equal 1. Moreover, if we flip the independence value of any of these assertions, then 
it is clear that G cannot be the structure, in other words, its probability given the flipped 
assertions (and any other variable such as V) must be 0. This results in the left factor 
Pr(G I ti:c, V) in all terms of the summation being 0, except for the term containing the as- 
signments consistent with the closure (where it is 1). If we denote by {tf, tf^, . . . , t^} = tf.^ 
these assignments, we get 

Pr(G I V) = Fiitf,, I P), 
and the lemma is proved by noticing that tf.^ is not more than the closure C{G). ■ 

Unfortunately, to the best of the authors knowledge, there is no method for computing 
the joint Pr(C(G) | D) of many independence assertions. We are forced then to make the 
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approximation that the independence assertions in the closure are all mutually independent. 
This gives us the IB-score(G) of G: 

P^(C(G) I V) = IB-Score(G) = JJ Pr(r = t\ V) (3) 

(/(T)=t)GC(G) 

which can now be computed usi ng the Bayesian test of independence of Margaritis ( 20051 ): 
Margaritis and Bromberg ( 20091 ) to compute the factors Pr(T = t\'D). 



The approximation implies that our belief in the independence of any triplet T in the 
closure is not affected by our knowledge that some other triplet T' in the closure is inde- 
pendent (or dependent). This my not be true in practice. It is certainly false when the 



two triplets are related through Pearl axioms of independence (jPearll . Il988l ) . In that case 
they determine each other, i.e., Pr(T | T',2?) = 1, and thus their joint probability is easily 
computable: Pr(T, T' \ V) = Pr(r | T'V) x Pr(T' | V) = Vi{T' \ V). For triplets that 
are not related through Pearl axioms we are forced into this approximation until a method 
for computing the joint over several independence assertions is developed (if this is at all 
possible). 

To conclude this section we note that we had made a single approximation, namely, that 
the random variables of independence assertions are mutually independent. The expression 
found can be computed efficiently, with a runtime complexity proportional to the sum of the 
complexities of t he statistical tests perforni e d for computing the probability Pr(r = t | P) 



of each factor. In Bromberg and Margaritis ( 20091 ) it is argued that the computational cost 



of performing a statistical test on data for the triplet {X, y | Z) is proportional to the 
number of data points in the dataset, i.e., A'^, times the total number of variables involved 
in the test, i.e., 2 + |Z|. So in the worst case, the complexity of the IB-score would be 

0(|C(G)|.iV.r*), (4) 

where r* = max7igc((j) |T|, i.e., the number of variables in the triplet, among all triplets in 
the closure, with the maximum number of variables. 



4. Practical algorithms for MAP optimization 

A full specification of the MAP search of Eq. ([2]) requires the specification of the search 
mechanism and the closure chosen for the structures. We present in what follows two 
approaches: IBMAP-HC (Independence-based maximum a posteriori hill climbing) and 
IBMAP-TS (Independence-based maximum a posteriori tree search) that differ in both 
the search mechanism (hill climbing and tree search, respectively), and in the choice of 
closure (Markov blanket based, and algorithm based, respectively). The motivation for 
these choices is two-fold. First, they serve as more realistic examples of closures than those 
given in Section 13.11 and second, the more implementations of the IB-score, more robust 
the experimental conclusions. Let us discuss in detail each of these approaches. 



4.1 IBMAP-HC: Independence-based MAP structure learning using hill 
climbing and Markov based closure. 

We proceed now with the specification of the closure and search procedure chosen for 
IBMAP-HC. 
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Given a domain of variables V, n = |V[, we propose the hill climbing local search 
mechanism for finding the best structure in the space of all undirected structures of size n. 
Starting from some structure G, the search proceeds finding the structure G' with maximum 
IB-score among all structure one edge-flip away from G. An edge-flip of some pair of 
variables {X, Y), consists on removing the edge {X, Y) from G, if such edge exists, or adding 
it otherwise. The amount of neighbors thus equals the number of pairs of variables, which is 
n(n — l)/2. The algorithm continues recursively from G' , until all structures one flip-away 
has smaller IB-score, i.e., until the algorithm reaches a loc al maxima. As a starti ng structure 
we chose the structure output by the GSMN algorithm ( Bromberg et al. . 20091 ). This way. 



the hill climbing search can be seen as a perturbation of the output of GSMN, with the hope 
that the local maxima in the proximity of GSMN has better quality. Experimental results 
confirm this is in fact the case. This could be done of course with structures output by any 
other structure learning algorithm, although we expect improvements only on independence- 
based algorithms. Our experiments show results for GSMN only. 

For the closure, we chose one based on Markov blankets. Let us first define the concept 
of Markov blanket and then explain how it can be used to specify a closure. 



Definition (|Pearll (|l988l ). p.97). The Markov blanket of a variable X eV is a setB^ <Z 
V — {X} of variables that "shields" X of the probabilistic influence of variables not in B"^. 
Formally, for every 7^ X G V, 

W (^B^ ^ I{X,W - {X,W}) (5) 

A Markov boundary is a minimal Markov blanket, i.e., non of its proper subsets satisfy 
Eq.(^. 

The substraction of X and Y from in the consequent of Eq. (jS]) is redundant, as 
neither X nor Y are in the blanket, and it is made explicit for later convenience. Also, 
unless explicitly stated, from now on any mention of Markov blanket refers to a minimal 
Markov blanket, that is, to a boundary. 

It can be proven that for Markov boundaries, the opposite direction of Eq. ([5]) also 
holds, that is. 

Lemma 3. For every 7^ X € V, ifB^ is the Markov boundary of X, then 

W ^B^ ^ I{X,W \B^ - {X,W}), (6) 
The proof of this Lemma is discussed in Appendix [Al 

We can now present the Markov-blanket closure used by IBMAP-HC. We do it through 
the following Theorem: 

Theorem 4 (Markov-blanket closure). Let \ be a domain of random variables, G an 
independence structure over V, and let B^ denote the Markov boundary of a variable X G 
V. Then, the set of independences 

Cmb{G) = { i{x, y I B^ - {X, y}) I X, y / X G v, (x, y) i e{g) } (J 

{-/(X, y I B^ - {X, y}) I X, y / X e v, (x, y) g e{G)] (7) 
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is a closure of G. That is, for each variable X and each other variable Y X , if the pair 
(X, Y) is an edge in G then add -'I{X, Y \ B"^ — {X, Y}) to the closure, otherwise add 
I{X,Y\B^ -{X,Y}). 

We need one final result to prove the Theorem, a relation between Markov boundaries 
and independence structures: 



Corollary 2 (jPearll (jl988l ). p. 98). The independence structure G of any strictly positive 
distribution over V can be constructed by connecting each variable X € V to all members 
of its Markov boundary . Formally, 

Vy G ^ {X, Y) G E{G) (8) 

Proof [Theorem 11] 

To prove that Cmb{G) determines G, we must prove the fact that an edge between X 
and Y exists or not is determined by the independencies in Cmb{G). We do it separately 
for existence and non-existence of edges. 

For edge existence: 

Let {X,Y) S E{G). Is this edge determined by C]\jb{G)? By definition of Cmb{G), 
the assertion -'I{X,Y \ B"^ — {X,Y}) must be in Cmb{G). It is sufficient then to prove 
that ^I{X,Y I B^ - {X,Y}) =^ {X,Y) G E{G). This follows from the counter positive of 
Eq. dSD and Eq. ([8]). 

For edge absence: Let {X,Y) ^ E{G). Is this lack of edge determined by Cmb{G)1 
By definition of Cmb{G), the assertion I{X,Y \ B^ — {X,Y}) must be in Cmb{G). It is 
sufficient then to prove that I{X,Y j B^ - {X,Y}) =^ {X,Y) ^ E{G). This follows from 
Eq. dSD and Eq. ©. ■ 



4.1.1 Complexity of IBMAP-HC 

Let's first discuss the computational complexity of IB-score using the Markov blanket clo- 
sure. According to Eq. (j4]), this complexity is 0{\Cmb\Xt*). To obtain |Cmb| note that for 
each variable X in V, and each other variable Y ^ X, there is either an edge or not, but 
both cases cannot happen. Thus, for any pair X ^ Y either the condition on the l.h.s. of 
the union in Eq. ([7D is true, or the condition in the r.h.s. is true, so only one independence 
assertion is added per pair of variables. Therefore, ICa/bI = ^{n — 1). We can also note 
that T* = 2 + IB"'''* I, where X* is the variable with the largest blanket. 

A straight-forward implementation of the hill climbing computes the IB-Score for each 
of the n(n — l)/2 neighbors, resulting in a total complexity of O {n^in - lfNT*/2). There 
is a trick, however, for reducing the complexity of each neighbor's IB-Score by one order 
of magnitude. The difference of each neighbor with the currently visited structure G is 
exactly one edge, say {X, Y). In the two cases of an edge being removed, or an edge being 
added, the blanket of both X and Y would change, and thus the conditionant of all 2(n— 1) 
independence assertions in the closure containing either X or Y. All other independence 
assertions would remain exactly as in G, and thus, an incremental computation of the 
IB-score would require only 2{n — 1) statistical tests, resulting in a time complexity of the 
(incremental) IB-score of O (2(n — 1)Nt*), and a time complexity of one hill climbing step of 
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O (n(n — l)^A^r*). That is, a reduction of a factor n/2 with respect to the non-incremental 
alternative. 

Finally, if we denote by M the total number of hill climbing iterations, the overall 
complexity of the IBMAP-HC algorithm is O (Mn{n — 1)'^Nt*). Unfortunately M is un- 
known, and can only be obtained through empirical measurement. We thus report it in our 
experiments. 

4.2 IBMAP-TS: Independence-based MAP structure learning using tree 
search and GSMN based closure. 

In this section we introduce another algorithm, IBMAP-TS (Independence-based max- 
imum a posteriori using tree search) for learning an independence structure using the 
independence-based MAP approach of Section [3j This algorithm implements the maxi- 
mization through a tree search, the uniform cost algorithm, using an algorithm-based closure 
described briefly above in Example [21 The resulting algorithm is not efficient, requiring an 
exponential number of statistical tests to find the structure with maximum IB-score. How- 
ever, we believe its description here and the presentation of experimental results later, to 
be helpful in two ways. First, the closure presented is generic, in the sense that can be eas- 
ily instantiated for other independence-based algorithm besides the GSMN algorithm used 
here. Also, although the algorithm is exponential in time, the quality improvements ob- 
tained through experimentation help to reinforce the hypothesis that the IBMAP approach 
improves the quality of the structures learned. 

Let's start formalizing the process followed by an independence-based algorithm. As 
mentioned above in Section 12.11 an independence-based algorithm proceeds as follows: at 
each iteration i, it proposes a triplet Tj and performs a statistical independence test (SIT) 
on data to obtain an independence assertion Tj = tf^^. The superscript SIT is included for 
later convenience, and denotes that the independence value considered is the one indicated 
by the statistical independence test. Which triplet is selected at iteration i depends on 
ri:i_i = tf.^^i, that is, on the sequence Ti-i^i of triplets proposed so far (in iterations 1 
through i — 1), and the independence values tf.^^^ that statistical tests assigned them (with 
^1:0) *f-o"^ indicating the empty set of triplets and assertions, respectively). Each iteration 
of an independence-based structure learning algorithm can thus be summarized as follows: 

Ti ^ Ain.,.,=tf.^,) (9) 
tf^^ ^ SIT{Ti) 

where A denotes the operator that decides the next triplet. 

These algorithms proceed until the set of independence assertions done so far is sufficient 
for determining a structure. By the definition of closure, this set is thus a closure. We call 
the closure obtained in this way an algorithm-based closure. 

How can we then use this closure in a MAP search? At each iteration i, we propose 
considering, besides the independence assertion Tj = tf^"^, the alternative value = -<tf^'^. 
In other words, we propose to distrust the statistical test. Interestingly, this change should 
not affect the algorithm. It would simply continue as if the statistical test would have given 
the other value. Moreover, it would also find a structure eventually, and the assertions 
found during its execution would thus be a closure as well. Clearly, the structure found 
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would be different. In summary, each such bifurcation into Tj = tf^^ and Tj = -^tf^^ 
produces a different family of structures whose posterior can be computed using as closure 
the assertions obtained during its execution. 

To finalize, we can notice each bifurcation splits the sequence of independences in two, 
each of which is recursively split in the next iteration. This can be modeled by a binary tree, 
with each node corresponding to an independence assertion, and whose children correspond 
to the two assignments for the triplet obtained by applying A to the assertions in the path 
from the root to the parent. An exception is the root node, which represents a dummy 
empty sequence node. Each path from the root to a leave, not only determines a structure, 
but it determines a closure {Ti = ti,T2 = t2, ■ ■ ■ ,Tc = tc} for that structure. This closure 
can then be used to compute the IB-score of the structure by multiplying the posterior 

c 

of the independence assertion of each node in the path, i.e., Y\ = ti \ T^)- Under 

■t=i 

this view, finding the structure with maximum IB-score (i.e., the MAP structure) can be 
done through any tree search algorithm using as successor function the operator A, and as 
cost of each action — logPr(Tj = ti \'D). Since the log of a product is the sum of the logs, 

c 

and the log is a monotonous function, finding the maximum IB-score Pr(Tj = ti \ T)) 

i=l 

is equivalent to finding the path with the minimal path-cost, i.e., the sum of the costs of 

c 

nodes in the path or ^ —logPr{Ti = ti I'D). 

We implemented and tested IBMAP-TS with the closure constructed using the GSMN 
algorithm and uniform cost strategy to search the tree. We chose GSMN to demonstrate 
our approach for being a well established independence-base d Markov networks structure 
learning algorithm. This algorithm uses the GS algorithm (jMargaritis and Thrun , 200d ) 
for learning the Markov blanket of every variable X G V, and uses Corollary 2 to build the 
structure. Results comparing the quality of structures learned by GSMN versus IBMAP-TS 
with uniform cost and GSMN based closure are shown in the experimental results section. 
One important advantage of uniform cost is that its solutions are optimal. To improve over 
its exponential runtime, we tested a few heuristics (not reported here), but with no success 
in avoiding exponential runtimes. We thus limit our findings to uniform cost. 

Fig.® shows an example (partial) search tree for a small system of three random 
variables V = {0,1,2}. Each node is annotated with both the triplet and indepen- 
dence assertion of the triplet (e.g., node II is labeled by 1(0,1 | {})), the posterior prob- 
ability of this independence assertion (e.g.,Pr(/(0, 1 | {}) | V) = 0.6), the local cost 
(e.g.,— log(Pr(/(0, 1 I {}) I V)) = 0.511), the partial computation of the IB-score (e.g., 
for II this would 0.4 x 0.6, the product of costs of all nodes from the root to II), and 
finally, the partial path-cost as the minus log of the product (e.g., 1.427 for 1/). The node 
with the checkmark (and underlined) is the one with lowest path-cost, and thus the next 
in line to be expanded by uniform-cost. 



5. Experimental evaluation 

We describe several experiments for testing the effectiveness of the IB-score for improv- 
ing the quality of independence structures discovered by our novel independence-based 
algorithms IBMAP-HC and IBMAP-TS (c.f. The experiments also corroborate the 
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21 

t = 1(0,2 \{}) 
Pr{t I V) ^ 0.4 
cost = 0.916 



II 

t = 1(0,1 \{}) 
Pr(t I V) = 0.6 

cost ^ 0.511 
0.6 X 0.4 = 0.24 



ID 

i = -/(0,l|{}), 
Pr(t I V) = 0.5 

cosi = 0.693 
0.5 X 0.4 = 0.20 



2D / 

t = -/(0,2l{}) , 

Pr(t \v)=qj 
cost = 0.357 




II 

7(0,1 I 2), 
Pr(t I V) = 0.75 

cost = 0.288 
0.75 X 0.7 = 0.53 



ID / 

t ^ -7(0,1 I 2), 
Pr(t I V) = 0.85 

cosi = 0.163 
0.85 X 0.7=0.60 



path-cost= 1.427 path-cost= 1.609 path-cost= 0.647 path-cost= 0.52 

Figure 4: Example partial binary tree expanded by IBMAP-TS with uniform cost search 
and GSMN-based closure. 



practicality of IBMAP-HC, i.e., its polynomial running time, as discussed theoretically in 
Section HXTl 

5.1 Experimental setup 

We discuss here the performance measures we use for testing the quality of output networks, 
and the time complexity of the algorithms, as well the datasets on which these performance 
was measured. 

5.1.1 Datasets 

We ran our experiments on benchmark (real-world) and sampled (artificial) datasets. Real 
world datasets allow an assessment of the performance of our algorithms in realistic settings 
with the disadvantage of lacking the solution network, thus resulting in approximate mea- 
sures of quality. We used the pub licly available benchmark datasets obtai ned from the UCI 
Repo sitories of machine learning (|A. Asuncionl . 120071 ) and KDD datasets fflettich and Bayi . 



Artificial datasets, although more limited in the scope of the results, are sampled 



from known networks allowing a more systematic and controlled study of the performance 
of our algorithms. 

Using a Gibbs sampler, data was sampled from several randomly generated undirected 
graphical models, each with a randomly generated graph and parameters. We considered 
models with different number of variables n (e.g., n = 12,50), and different number r of 
neighbors per node (we used r = 1, 2,4, 8 in all our experiments). For a given n and r, 10 
graphs were generated, connecting randomly and uniformly, each of the n nodes to r other 
nodes. This was achieved by connecting each node i with the first r nodes of a random 
permutation of [1, . . . , i— 1, z-|-l, . . . , n]. For each of the generated gra phs, a set of para meters 
was generated randomly following a procedure described in detail in lBromberg et al. I (|200a l 
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that guarantees dependencies remains strong for variables distant in the network. Finahy, 
one dataset was sampled for each of the generated pair of graph and set of parameters. 

5.1.2 Computational cost measure 

One of the hypothesis we wanted to prove with our experiments is the polynomial run- 
time of IBMAP-HC. In Section gXT] we found the time complexity of IBMAP-HC to be 
O {Mn{n - ifNr*). For this expression to be polynomial both M and r* should at most 
grow polynomially with n. The size of the largest triplet test r* depends only in the con- 
nectivity of the networks, which we kept fixed at r. Instead, we have no elements to predict 
the behavior of M, as it depends solely on the landscape of the score function. We thus 
report M in our experiments as a measure of complexity. 

5.1.3 Error measures 

We measured quality through two types of errors between the output network and a model 
considered to be the correct one: the edges and independences Hamming distance. For 
the case of artificial datasets the comparison was done against the true, known, underlying 
network, whereas for real datasets the comparison was done against the data itself. Let us 
discuss these quantities in detail: 

• The edge Hamming distance H-e{G, G') between two graphs G and G' of equal number 
of nodes, represents the number of edges that exist in G, and do not exist in G' , and 
vice versa. Put another way, it measures the minimum number of edge substitutions 
required to change one graph into the other. To measure the error of a structure 
G output by a structure learning algorithm we measure its edges Hamming distance 
He{G*,G), or simply H^{G), with the solution network G*. Formally, if we define 
the following indicator function for evaluating the existence of an edge between two 
variables X and Y in G, 

E''{X,Y) = li iX,Y) is an edge in G 
[ U otherwise. 

the edges Hamming distance is defined as 



X,Y £Y,X ^Y 



E^{X,Y) / E'^'{X,Y) 



(11) 



Given two probability distributions P and P' over the same set of variables V, we 
define the independences Hamming distance Hi{P,P') between them to be the (nor- 
malized) number of matches in a comparison of the independence assertions that holds 
in P and P' . That is, if T denotes the set of all possible triplets over V, it is checked 
for how many triplets t £ T,t is independent (or dependent) in both distributions, and 
then normalized by T. Unfortunately, the size of T is exponential. We thus compute 
the approximate Hamming distance Hi{P,P') over a randomly sampled subset of T, 
uniformly distributed for each conditioning set cardinality. In all our experiments we 
used |T| = 2, 000, constructed as follows: for each m = 0, 1, 2, . . . , n — 2, |T|/(n — 1) 
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triplets {X,Y \ Z) with cardinality \Z\ = m were sampled randomly and uniformly 
by first generating a random permutation [tti, 7r2, . . . , 7r„] of the set of all variables V, 
and the assigning X = tti, Y = -^2, and Z = [ir^, . . . , tt„i] . 

In what follows we are not given a fully specified distribution. Instead, dependending 
on the type of dataset being artificial or real, we are given the independence structure 
or a sample (i.e., a dataset), respectively. Independencies, however, can be measured 
over both independence structures and datasets, so in both cases we can measure 
(approximately) the independence Hamming distance. Let's consider both cases. 

To estimate the error of structures output by our algorithms when run over artificial 
datasets, we measured their independence Hamming distance Hi{G,G*) (or simply 
H^{G)), with the underlying true network G*, querying independences directly on 
the structures using vertex-separation. Formally, if we denote I^* (t) the result of a 
test t ^ T performed on the true model, and by I'^(i) the result of the same test t 
performed on a model G, then the independence Hamming distance Hl{G) is defined 
formally as: 



Ht{G) = 



T 



{teT 



G* 



(12) 



In experiments over real datasets the underlying structure G* is unknown. We thus 
conducted experiments learning the structure over smaller datasets with sizes 1/3 and 
1/5 of the input dataset V, and compared the independencies in the output structure 
G, and the complete dataset V. The resulting Hamming distance is denoted Hi{G, T>), 
or simply H^{G). Formally, if we denote by I^{t) the result of a test t G T performed 
on the complete dataset, then Hf{G) is defined as: 



H?iG) 



1 



r 



{teT 



V, 



(13) 



We are interested in comparing the errors of structures output by our algorithms (IBMAP- 
HC , IBMAP-TS) and GSMN, the competitor. For that we report the ratio r = J^Ggsmn) 
of the errors of the network Ghc output by IBMAP-HC against the error of the network 
Ggsmn output by GSMN. Similarly, for network Gts output by IBMAP-TS, we report 
the ratio r = ^'^Tg) _ Being three different types of errors, the edge Hamming distance 

H{GgSMN) ° -"^ 1 to to 

H^{G) and the two independence Hamming distances H^{G) and Hf{G), this results in 
six possible ratios, three for HG and three for TS: 



\{HG) 



{TS) 



(Ggsmn)' 



r\{HG) 
r\{TS) 



_ J^f(GTs) rfiTS) 



^ ^fff (Ghc) 

f^p (Ggsmn) 
_ (Gts) 



(Ggsmn)' 'iv^^/ H{{Ggsmn)' ' i ^/f (Ggsmn) 

These ratios allows a quick comparison between the two algorithms it involves as a ratio 
equal to one means the same error in the structures they output, a ratio lower than one 
means a reduction in the error of the structures output by our algorithms (HC or TS) and 
a ratio greater than one means a reduction in quality by our algorithms. 
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5.2 Experimental results 

We show now the results of our experiments. 

5.2.1 Sampled Data Experiments 

In our first set of experiments, we demonstrate that our two IBMAP algorithms IBMAP-HC 
and IBMAP-TS are successful in improving the quality over artificially generated datasets 
by comparing their edge errors H^{Giic) and H^{Gts) as well as their independence errors 
Hf{Giic) and H^{Gts) against the corresponding errors -^^(Ggsmn) Hj{Ggsmn) of GSMN. 

As the IBMAP-TS algorithm is impractical for larger domains, we considered two sce- 
narios, one over underlying models of size n = 12, comparing errors of both IBMAP-HC 
and IBMAP-TS against GSMN, and one over larger models of size n = 50, comparing only 
IBMAP-HC against GSMN (with results shown in ). To assess our algorithms over dif- 
ferent conditions of reliability and connectivity, we ran them over datasets with increasing 
number of datapoints = 40, 200, 800, 5000, 12000, sampled from networks of increasing 
node degrees r = 1, 2, 4, 8. To increase statistical significance, 10 datasets were sampled for 
each pair (ra, r), and for each of them, one subsample of size N was obtained by randomly 
selecting N datapoints. 

Table [1] shows the edge errors for n = 12, reporting the mean values and standard 
deviations (in parenthesis) of the edge errors ^^^(Ggsmn), H^{Gts), and H^{Giic) of GSMN, 
IBMAP-TS and IBMAP-HC, respectively, for the different conditions of connectivity and 
reliability. The last two columns show their corresponding ratios r^{TS) and r^{HC), 
indicating in bold the statistically significant improvements (i.e. ratios lower than 1), and 
underlined the statistically significant reductions in errors (i.e., ratios greater than 1). These 
error ratios are plotted also in Figure [5] (left column). The independence errors H^{Ggsmn), 
H^{G'rs) and i7^(GHc) for n = 12 are shown in Table [2] with a similar structure, with the 
ratios also plotted in Figure [5] (right column) . 

These results show that in most cases our proposed algorithms present quality improve- 
ments over GSMN. For datasets with connectivities r = 1,2,4, the Hamming distances are 
better or equal for IBMAP-TS and IBMAP-HC on all cases in Table [U and almost all cases 
in Table [2j In some cases the improvement are drastic (e.g. r^{HC) = 0.058(0.088), for 
T = 4, = 12000 meaning that for approximately 6 wrong edges in Ghc there are 100 
wrong edges in Ggsmn)- The pour results for datasets with connectivity r = 8 can be 
explained by recalling the approximation made in Eq.dSj), namely, that conditional inde- 
pendence assertions are mutually independent. This independence is not expected to hold 
over assertions involving the common variables, with the stronger dependence for larger 
overlaps. 

This tendency is clearer in the results for n = 50, shown in Table [Sj plotted in Figured 
The table shows the mean value and standard deviation (in parenthesis) of edge Hamming 
distance H^{Ggsmn), H^{Guc) and their ratio r^{HC) on the left group of columns, and 
Hi (Ggsmn), Hj{Giic) and their ratio r\{HG) on the right group of columns. Again, bold 
(underline) signifies an increase (decrease) in the quality IBMAP-HC over the quality of 
GSMN. We can observe that for the case of edge Hamming distance, as r increases, the 
number of bold ratios decreases (5,4,2, and 2 for r = 1, 2, 4, 8, respectively), and the number 
of underlined ratios increases (1,2,3, and 2 for r = 1,2,4,8, respectively). 
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Comparison of Hamming distances, n = 12 
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Figure 5: Average of r^{TS), r^{HC) (left column), rj{TS) and r\{HC) (right column) 
over 10 datasets with n = 12 and r = 1,2,4,8 (rows). Error bars represents standard 
deviation. The smaller the value, more important is the improvement, with values greater 
(smaller) than 1 represent reduction (increase) in the quality. 
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Edges Hamming distances, n = 12 
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8.300(1.332) 4.400(1.669) 4.600(1.457) 
4.400(1.734) 2.200(0.986) 0.500(0.499) 
2.400(0.828) 1.500(0.685) 0.200(0.297) 


0.880(0.124) 
0.772(0.157) 
0.553(0.203) 
0.542(0.135) 
0.575(0.245) 


1.073(0.098) 
0.947(0.088) 
0.543(0.136) 
0.183(0.221) 
0.058(0.088) 


8 


40 
200 
800 
5000 
12000 


30.600(4.475) 33.600(3.261) 34.300(2.702) 
23.200(2.634) 25.600(3.086) 28.900(3.414) 
17.400(2.402) 19.900(2.648) 25.000(2.532) 
10.300(1.489) 13.600(1.494) 16.400(1.601) 
9.200(2.913) 12.200(2.613) 12.900(2.810) 


1.116(0.082) 
1.112(0.103) 
1.152(0.089) 
1.362(0.230) 
1.469(0.356) 


1.145(0.100) 
1.250(0.079) 
1.462(0.132) 
1.631(0.198) 
1.520(0.290) 



Table 1: Average and standard deviation of ^^^(Ggsmn), H^{Gts) and H^{Giic) (columns) 
for datasets sampled from 10 random networks of n = 12 and r = 1, 2, 4, 8 (rows). Last two 
columns shows r^{TS) and r^{HC), with quality improvements when r < 1 (in bold), and 
quality reductions when r > 1 (underlined). 



The plots in Figure [6] show clearly how both ratios r^{HC) (left) and r\[HC) (right) 
decrease with A^, with the decrease being slower as r increases. Although for small A^s there 
are some pour results for edge errors, these ratios are never greater than 1.231 (for r = 4, 
A^ = 500), and in most cases has a corresponding improvement in the independence error. 
For instance, for the same case of r = 4, A^ = 500, the independence ratio is 0.926(0.061), 
a value smaller than 1 with statistical significance. In all other cases, only the four cases of 
A^ = 100 show no improvement, with only the case of r = 4 showing an increase in error. 
The remaining cases of edge errors show significant improvements reaching in many cases 
of large A^s ratios smaller than 0.1, with proportions of 50 to 1 wrong edges in GSMN and 
IBMAP-HC, respectively, for r = 4, A^ = 12000. 

To conclude this section we present results for M, the number of hill climbs conducted 
by IBMAP-HC. We measured this quantity for runs of the algorithms on datasets sampled 
from graphs of increasing size n, connectivities r = 1,2,4,8, and three different reliability 
conditions A^ = 200, 1000,5000. Figure [7] shows four plots, one per r, with 3 curves each, 
one per dataset size A^. 
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Independences Hamming distances, n = 12 


T 


N 


-'^i*(G'gsmn) H^{Gts) H^(G-ac) 


r*{TS) 


r*{HC) 


1 


40 
200 
800 
5000 
12000 


0.593(0.058) 0.293(0.048) 0.322(0.098) 

0.164(0.096) 0.084(0.054) 0.020(0.013) 
0.027(0.032) 0.036(0.044) 0.012(0.015) 
0.005(0.006) 0.004(0.005) 0.000(0.000) 


0.497(0.074) 

0.565(0.267) 

1.132(0.276) 
0.916(0.229) 


0.541(0.139) 
n ^Qfi('n mi 
0.235(0.247) 

0.766(0.280) 
0.700(0.341) 


2 


40 

200 

800 

5000 

12000 


0.444(0.031) 0.379(0.052) 0.376(0.058) 
0.297(0.078) 0.208(0.045) 0.254(0.066) 
0.166(0.040) 0.094(0.050) 0.045(0.036) 
0.085(0.054) 0.062(0.042) 0.006(0.012) 
0.030(0.032) 0.031(0.033) 0.000(0.000) 


0.858(0.118) 
0.742(0.153) 
0.507(0.199) 
0.777(0.228) 

1.000(0.025) 


0.843(0.111) 
0.904(0.197) 
0.273(0.220) 
0.200(0.247) 
0.500(0.372) 


4 


40 

200 

800 

5000 

12000 


0.197(0.032) 0.296(0.034) 0.272(0.033) 
0.127(0.020) 0.134(0.045) 0.152(0.025) 
0.094(0.018) 0.056(0.022) 0.052(0.017) 
0.051(0.023) 0.026(0.011) 0.007(0.009) 
0.027(0.011) 0.022(0.012) 0.002(0.003) 


1.555(0.244) 
1.071(0.343) 
0.597(0.193) 
0.596(0.168) 

0.800(0.532) 


1.426(0.225) 
1.238(0.278) 
0.534(0.110) 
0.203(0.234) 
0.060(0.093) 


8 


40 

200 

800 

5000 

12000 


0.238(0.145) 0.291(0.128) 0.271(0.132) 
0.094(0.013) 0.142(0.018) 0.125(0.016) 
0.059(0.006) 0.095(0.012) 0.097(0.010) 
0.031(0.003) 0.044(0.005) 0.052(0.004) 
0.024(0.007) 0.033(0.008) 0.036(0.008) 


1.382(0.196) 
1.531(0.166) 
1.606(0.142) 
1.453(0.207) 
1.488(0.326) 


1.246(0.132) 
1.343(0.083) 
1.644(0.102) 
1.713(0.182) 
1.600(0.307) 



Table 2: Average and standard deviation of H^{Ggsmn), H^iG-rs) and H^{Giic) (columns) 
for datasets with n = 12 and r = 1,2,4,8 (rows). Last two columns shows r'^{TS) and 
rj{HC), with quality improvements when r < 1 (in bold), and quality reductions when 
r > 1 (underhned). 



In all cases, M presents a quasi-linear grow on n. After incorporating M as a a linear 
function on n into the expression 0{Mn{n - ifNr*) for the complexity of the IBMAP-HC 
algorithm (c.f. i i5.1.2p . we obtain empirically the runtime grows as O(n^). 

To give the reader a sense of the runtime, we report the runtime for the hardest scenario. 
In a Java virtual machine running on an AMD Athlon processor of 64 bits and 4800 Mhz 
and 1 Gb of RAM, the case for n = 50, N = 12000, and r = 8 took 5 days, with a reduction 
to 5 hours after implementing a simple cache for avoiding the computation of repeated tests. 

5.2.2 Experiments on benchmark datasets 

To conclude the discussion of our experiments we compare quality performances of IBMAP- 
HC against GSMN on benchmark datasets. For this datasets the underlying true network 
is unknown, and we are thus restricted to independence Hamming distances measured on 
the dataset. 
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Comparison of H^{Gyic)i ^^ = 50 
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Figure 6: r^{HC) (left) and r\{HC) (right) for datasets with n = 50 and r = 1,2,4,8 
(color-coded) ran on datasets with increasing number of datapoints N . Bars represents 
average over ten different datasets. Error bars represents standard deviation. 



Growth in the number of ascents of IBMAP-HC 
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Figure 7: Number of IBMAP-HC ascents M on datasets with increasing number of variables 
re, connectivities r = 1,2,4,8 (four plots), N = 200,1000,5000 (three curves). 



We ran the IBMAP-HC and GSMN algorithms ten ti mes on each of a set of datasets 
obtained from the UCI Reposit ories of machine learning (|A. Asuncionl . 120071 ) and KDD 
datasets ( Hettich and Bav . 19991 ). Each ran on a different randomly sampled subset of the 
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n = 50 


T 


N 


Edges Hamming distances 


Independences Hamming distances 


-'^e(Ggsmn) 




r*iHC) 


-f^I*(G'GSMN) 


^^i*(G'hc) 




1 


100 
500 
1000 
2000 
5000 
12000 


193.200(4.255) 
156.500(9.919) 
91.200(10.771) 
55.000(12.296) 
22.200(3.758) 
13.400(2.514) 


205.800(3.133) 
145.200(16.865) 
37.300(18.338) 
1.556(0.422) 
1.700(0.943) 
0.800(0.728) 


1.066(0.026) 
0.923(0.063) 
0.389(0.160) 
0.030(0.009) 
0.071(0.040) 
0.055(0.049) 


0.789(0.010) 
0.720(0.023) 
0.556(0.034) 
0.389(0.075) 
0.163(0.034) 
0.098(0.020) 


0.770(0.022) 
0.447(0.075) 
0.189(0.072) 
0.004(0.002) 
0.005(0.003) 
0.002(0.002) 


0.977(0.029) 
0.616(0.089) 
0.333(0.122) 
0.010(0.004) 
0.029(0.015) 
0.018(0.016) 


2 


100 
500 
1000 
2000 
5000 
12000 


186.200(4.370) 
153.500(7.465) 
117.100(11.527) 
81.400(10.540) 
35.700(4.595) 
20.600(3.328) 


206.200(3.959) 
163.900(12.135) 
93.200(21.019) 
14.500(13.967) 
0.900(0.966) 
0.600(0.595) 


1.108(0.025) 
1.067(0.055) 
0.778(0.108) 
0.150(0.137) 
0.024(0.025) 
0.034(0.037) 


0.605(0.014) 
0.566(0.022) 
0.502(0.027) 
0.433(0.030) 
0.273(0.028) 
0.163(0.019) 


0.615(0.020) 
0.435(0.036) 
0.230(0.035) 
0.067(0.055) 
0.005(0.005) 
0.003(0.004) 


1.016(0.020) 
0.767(0.044) 
0.457(0.059) 
0.146(0.119) 
0.016(0.017) 
0.022(0.026) 


4 


100 
500 
1000 
2000 
5000 
12000 


167.727(9.113) 
128.700(5.097) 
118.900(4.315) 
106.200(6.297) 
85.800(7.594) 
51.800(6.200) 


199.818(6.557) 
158.200(7.175) 
140.900(5.791) 
98.900(12.825) 
35.100(14.526) 
0.900(1.348) 


1.195(0.033) 
1.231(0.056) 
1.186(0.034) 
0.928(0.089) 
0.395(0.142) 
0.015(0.022) 


0.257(0.012) 
0.258(0.012) 
0.243(0.009) 
0.219(0.008) 
0.205(0.012) 
0.145(0.011) 


0.286(0.012) 
0.238(0.016) 
0.200(0.014) 
0.129(0.021) 
0.045(0.018) 
0.002(0.002) 


1.115(0.047) 
0.926(0.061) 
0.823(0.053) 
0.585(0.081) 
0.218(0.085) 
0.012(0.015) 


8 


100 
500 
1000 
2000 
5000 
12000 


296.000(7.433) 
261.900(4.707) 
237.800(9.145) 
244.222(13.415) 
263.000(19.221) 
232.100(23.718) 


298.600(9.795) 
276.700(4.760) 
255.700(8.231) 
253.889(10.707) 
225.000(9.314) 
188.400(12.461) 


1.009(0.024) 
1.057(0.024) 
1.077(0.042) 
1.042(0.045) 
0.860(0.047) 
0.822(0.071) 


0.157(0.004) 
0.137(0.003) 
0.134(0.004) 
0.125(0.005) 
0.127(0.006) 
0.112(0.007) 


0.150(0.006) 
0.130(0.004) 
0.119(0.002) 
0.115(0.005) 
0.102(0.003) 
0.089(0.005) 


0.961(0.040) 
0.947(0.032) 
0.888(0.033) 
0.919(0.041) 
0.807(0.036) 
0.793(0.056) 



Table 3: Experiments results of edge (left) and independence (right) Hamming distances 
for several datasets with n = 50, r = 1,2,4,8 and increasing number of data points 
(rows). Standard deviations are shown in parenthesis. Again, quality improvements are in 
bold, and quality reductions are underlined. 



whole dataset of 1/3 its size. The same was repeated for subsets of 1/5 the total size. The 
independence Hamming distances H^{Guc), and H^{G qsmn) where then computed com- 
paring the independencies of the output networks with the independencies of the complete 
dataset. 

Results of these experiments are shown in Tableland Figure [8] (left) for the 1/3 case, 
and Tableland Figure [8] (right) for the 1/5 case. On both figures, the numbers in the x- 
axis represent the indexes of each dataset in the corresponding table. The tables also show 
on their last two columns values for r^{HC) and M. These results show improvements 
in most cases, again, shown in bold in the tables and bars lower than one in the figures. 
Quality reductions are underlined in both tables. 
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Comparison of independence Hamming distances 
for benchmark data sets 
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Figure 8: Experimental results for real datasets. The table shows average and standard 
deviation of rf{HC) over 10, 1/3 subsets of P (left) and over 10, 1/5 subsets (right). Again, 
smaller values mean greater improvements. 



# 


name 


n 


N 


-fff (Ggsmn) 




r?{HC) 


M 


1 


bridges.csv 


12 


140 


0.445(0.018) 


0.450(0.020) 


1.012(0.046) 


7.700(1.105) 


2 


dependent. CSV 


3 


1000 


0.000(0.000) 


0.000(0.000) 


1.000(0.000) 


1.000(0.000) 


3 


flare2.csv 


13 


1065 


0.203(0.009) 


0.204(0.016) 


1.006(0.074) 


10.200(2.298) 


4 


haberman.csv 


5 


612 


0.309(0.029) 


0.309(0.024) 


1.003(0.040) 


1.500(0.372) 


5 


imports-85.csv 


25 


386 


0.601(0.006) 


0.602(0.007) 


1.001(0.014) 


20.667(1.942) 


6 


balance-scale.csv 


5 


1250 


0.421(0.015) 


0.412(0.013) 


0.980(0.024) 


1.100(0.223) 


7 


car. CSV 


7 


3456 


0.425(0.006) 


0.412(0.006) 


0.968(0.016) 


2.000(0.412) 


8 


monks- 1. CSV 


7 


1112 


0.091(0.011) 


0.091(0.031) 


0.952(0.158) 


1.182(0.393) 


9 


dermatology, csv 


35 


357 


0.724(0.008) 


0.687(0.025) 


0.949(0.028) 


69.556(6.525) 


10 


glass. CSV 


10 


210 


0.449(0.029) 


0.418(0.014) 


0.939(0.069) 


5.700(1.054) 


11 


nurscry.csv 


9 


25920 


0.551(0.035) 


0.491(0.024) 


0.895(0.053) 


4.000(0.940) 


12 


machine. CSV 


10 


201 


0.411(0.044) 


0.341(0.015) 


0.839(0.083) 


6.750(1.088) 


13 


haycs-roth.csv 


6 


264 


0.358(0.056) 


0.291(0.061) 


0.816(0.133) 


3.600(1.531) 


14 


tictactoc.csv 


10 


1916 


0.584(0.021) 


0.476(0.021) 


0.815(0.024) 


9.500(1.459) 


15 


baloons.csv 


5 


20 


0.324(0.113) 


0.244(0.087) 


0.806(0.164) 


1.900(0.617) 


16 


lenses. CSV 


5 


48 


0.306(0.080) 


0.145(0.046) 


0.545(0.223) 


2.200(0.446) 



Table 4: Real data independence Hamming errors experiments for a 1/3 of the complete 
dataset. Quality improvements when r < 1 (in bold), and quality reductions when r > 1 
(underlined). 



These results show again quality improvements of IBMAP-HC over GSMN, with only 
two cases, lenses and flare2 in the 1/5 case showing reductions. Improvements occurred in 
the 8 out of 16 in both cases. 
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# 


name 


n 


N 


f Gr^ Q^/rl^T ) 




rViHC) 

I \ 1 


M 


1 


lenses. CSV 


5 


48 


0.241(0.076) 


0.294(0.093) 


1.225(0.174) 


1.700(0.341) 


2 




5 


20 


'\7R((^ AVi7\ 




1 1 7f\(C\ 


1 1 nrnT) f<AA\ 


Q 
O 


ficire2.csv 






1 QC\(C\ C\C\A\ 

U. i-^Kj\Kj .yjKj'-t ) 




1 OQQ/'n C\(\C\\ 


Q '^00/' 1 7Q9~1 
y .ouui^ 1 . / yz ^ 


A 

4 


bridges. CSV 


1 o 


1 /I n 


n ACiA i n noo\ 
U.4U4(^U.UzZ J 


n /in'7/'n ^o'7^ 
U.4U / \ I ) 


i.Uill^U.UOoJ 


n nnn/ 1 oq'7\ 
y.UUUf^i.zo/ j 


O 


dependent . csv 


n 
O 


iUUU 


U.UUU^U.UUU J 


n nnn/^n ^^^^ 
U.UUUl^U.UUU ) 


1 r\c\r\l r\ ^^^^ 
i.UUUl^U.UUUJ 


1 c\c\c\(c\ nnn\ 
l.UUU^U.UUU j 


D 


glass. CSV 


1 n 
iU 




U.o0o(^U.U41 ) 


U.vS4z(^U.Ulo J 


n OQn/'n i OQ^ 
U.yoU(^U. iZo J 


/ .oUU(U.y4o j 


7 


machine. CSV 


10 


201 


0.357(0.047) 


0.336(0.024) 


0.967(0.130) 


5.400(0.595) 


Q 
O 


liaberman. CSV 


c: 
o 


612 






n Qfi'^/'n OziQ^ 
u. yuoi^u.u^y j 


-L.yuu\^u.ui ( ) 


9 


dermatology. CSV 


35 


357 


0.709(0.006) 


0.653(0.012) 


0.922(0.021) 


58.100(4.490) 


10 


imports-85.csv 


25 


386 


0.550(0.024) 


0.495(0.016) 


0.902(0.027) 


19.900(2.290) 


11 


balance-scale. CSV 


5 


1250 


0.422(0.033) 


0.366(0.048) 


0.871(0.107) 


1.700(0.581) 


12 


nursery. CSV 


9 


25920 


0.469(0.029) 


0.405(0.028) 


0.868(0.066) 


4.100(0.966) 


13 


car. CSV 


7 


3456 


0.386(0.034) 


0.331(0.040) 


0.860(0.066) 


3.000(0.940) 


14 


tictactoc.csv 


10 


1916 


0.584(0.021) 


0.476(0.021) 


0.815(0.024) 


9.500(1.459) 


15 


monks- 1. CSV 


7 


1112 


0.126(0.041) 


0.096(0.018) 


0.812(0.094) 


1.300(0.341) 


16 


hayes-roth.csv 


6 


264 


0.302(0.062) 


0.227(0.040) 


0.782(0.120) 


1.900(0.844) 



Table 5: Real data independence Hamming errors experiments for a 1/5 of the complete 
dataset. Quality improvements when r < 1 (in bold), and quality reductions when r > 1 
(underlined) . 



6. Conclusions and future work 

In conclusion, the IBMAP-HC algorithm provides the possibility of solving the problem for 
complex systems getting significant quality improvements over GSMN. IB-score seems to 
be a good and efficient likelihood function. 

Prom this work, several future research possibilities arise that we are motivated to pur- 
sue, including continuing to look into a practical algorithm with lower computational cost 
(to be used with even more complex systems), experience with other approximate opti- 
mization algorithms such as local beam Metropolis Hastings, continue with the analysis of 
quality measures and propose new, more efficient and reliable scoring functions, and per- 
form a thorough comparison of these novel scoring functions and existing scoring functions 
such as the likelihood of the data given the complete model (structure plus parameters). 
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Appendix A. Proof of Lemma [3] 

All along this work we made the running assumption of graph-isomorphism, that is, that 
the underlying distribution we are trying to learn has a graph that encodes all and only the 
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independencies that holds in the distribution. According to Theorem 2 of Pear] (| 19881 ). 



p. 97], the following properties among independence hold in any graph-isomorph distribution: 

• Intersection: /(X, Y \Z,W) h I{X, W \ Z,Y) ^ I{X; Y,W\Z) 

• Decomposition: I{X; Y,W \ Z) ^ I{X, Y \ Z) A I{X, W \ Z) 

• Strong Union: I{X, Y\Z)^ I{X, Y \Z,W) 

Before proving the main lemma of the appendix we present an auxiliaty lemma. 

Auxiliary Lemma 5. For all X ^ Y ^ W £ \ and Z (^Y - {X, Y, W}, 

^i{x, y I Z) A I{X, Y \Z,W)^ ^I{X, w I Z) 

Proof By Intersection and decomposition, 

I{X,Y \Z,W) AI{X,W \Z,Y) =^ I{X;Y,W\Z) 

I{X,Y\Z). 

Then, by the counter-positive of this expression and the counter-positive of Strong Union, 

-^I{X,Y \Z) AI{X,Y \Z,W) -^I{X,W\Z,Y) 

-^I{X,W\Z). 



We can now prove Lemma O reproduced here for convenience: 
Lemma [3l For every W ^ X ^ V , and the boundary of X, 

W (^B^ ^ I{X, lU I - {X, W}) 

Proof 

We prove its counter-positive W € ^ ^ I ~ {^i ^})- By minimality of 

B^, there must be a variable Y ^ B"'^ such that removing W from the blanket, Y becomes 
dependent of X, i.e., -^I{X,Y \ B^ - {W} - {X,Y}). Also, since Y ^ B^, by the defini- 
tion of boundary of Eq. ([5]) it holds that I{X, Y \ B^ — {X, Y}). The lemma follows then 
from Auxiliary Lemma[5]by letting Z = B^ — {X, y} — {W}. ■ 
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